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Multiple High-resolution Serum Proteomic Features for Ovarian Cancer Detection 
Background 

[1001] Serum proteomic pattern analysis by mass spectrometry (MS) is an emerging 
technology that is being used to identify biomarker disease profiles. Using this MS-based 
approach, the mass spectra generated from a training set of serum samples is analyzed by 
a bioinformatic algorithm to identify diagnostic signature patterns comprised of a subset 
of key mass-to-charge (m/z) species and their relative intensities. Mass spectra from 
unknown samples are subsequently classified by likeness to the pattern found in mass 
spectra used in the training set. The number of key m/z species whose combined relative 
intensities define the pattern represent a very small subset of the entire number of species 
present in any given serum mass spectrum. 

[1002] The feasibility of using MS proteomic pattern analysis for the diagnosis of 
ovarian, breast, and prostate cancer has been demonstrated. While investigators have 
used a variety of different bioinformatic algorithms for pattern discovery, the most 
common analytical platform is comprised of a low-resolution time-of-flight (TOF) mass 
spectrometer where samples are ionized by surface enhanced laser desorption/ionization 
(SELDI), a ProteinChip array-based chromatographic retention technology that allows for 
direct mass spectrometric analysis of analytes retained on the array. 

[1003] ^yflrian eaneer is the leading cause of gynecological malignancy and is' the 
fifth most common cause of cancer-related death in women. The American Cancer 
Society estimates that that there will be 23,300 new cases of ovarian cancer and 13,900 
deaths in 2002. Unfortunately, almost 80% of women with common epithelial ovarian 
cancer are not diagnosed until the disease is advanced in stage, i.e., has spread to the 
upper abdomen (stage HI) or beyond (stage IV). The 5-year survival rate for these 
women is only 15 to 20%, whereas the 5-year survival rate for ovarian cancer at stage I 
approaches 95% with surgical intervention. The early diagnosis of ovarian cancer, 
therefore, could dramatically decrease the number of deaths from this cancer. 
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[1004] The most widely used diagnostic biomarker for ovarian cancer is Cancer 
Antigen 125 (CA 125) as detected by the monoclonal antibody OC 125. Though 80% of 
patients with ovarian cancer possess elevated levels of CA 125, it is elevated in only 50- 
60% of patients at stage I, lending it a positive-predictive value of 10%. Moreover, CA 
125 can be elevated in other non-gynecologic and benign conditions. A combined 
strategy of CA 125 determination with ultrasonography increases the positive-predictive 
value to approximately 20%. 

[1005] Low molecular weight serum proteomic patterns from low-resolution SELDI- 
TOF MS data can distinguish neoplastic from non-neoplastic disease within the ovary. 
See Petricoin, E. F. Ill et al Use of proteomic patterns in serum to identify ovarian 
cancer. The Lancet 359, 572-577 (2002). The proteomic patterns can be identified by 
application of an artificial intelligence bioinformatics tool that employs, an unsupervised 
system (self-organizing cluster mapping) as a fitness test for a supervised system (a 
genetic algorithm). A training set comprised of SELDI-TOF mass spectra from serum 
derived from either unaffected women or women with ovarian cancer is employed so that 
the most fit combination of m/z features (along with their relative intensities) plotted in n- 
space can reliably distinguish the cohorts used in training. The "trained" algorithm is 
applied to a masked set of samples that resulted in a sensitivity of 100% and a specificity 
of 95%. This technique is described in more detail in WO 02/06829A2 "A Process for 
JJis^miMting B^f&m' ffim^m ' States; Bas^og Hnkfear "Patterns - ftom~Biufagicai 
DaTa" ("ffiaden Patterns") the disclosure of wffi<& is hereby expressly tacmp'orated 
herein by reference. 

[1006] Although this technique works well, the low-resolution mass spectrometry 
instrumentation and thus the data that comes from the instrument may limit the attainable 
reproducibility, sensitivity, and specificity for proteomic pattern analyses for routine 
clinical use. 

Summary 

[1007] The protein pattern analysis concept of Hidden Patterns is extended to a high- 
resolution MS platform to generate diagnostic models possessing higher sensitivities and 
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specificities on a format that generates more stable spectra, has a true time-of-flight mass 
accuracy, and is inherently more reproducible machirie-to-machine and day-to-day 
because of the increase in mass accuracy. Sera from a large, well-controlled ovarian 
cancer screening trial were used and proteomic pattern analysis was conducted on the 
same samples on two mass spectral platforms differing in their effective resolution and 
mass accuracy. The data was analyzed so as to rank the sensitivity and specificity of the 
series of diagnostic models that emerged. 

[1008] The spectra from a high-resolution and a low-resolution mass spectrometer 
with the same patients' sera samples applied and analyzed on the same SELDI 
ProteinChip arrays were compared. Although the higher resolution mass spectra may 
generate more distinguishable sets of diagnostic features, the increased complexity and 
dimensionality of data may reduce the likelihood of fruitful pattern discovery. 
Diagnostic proteomic feature sets can be discerned within the high-resolution spectra 
from the clinically relevant patient study set, and the modeling outcomes between the two 
instrument platforms can be compared. The number and character of the diagnostic 
models emerging from data mining operations can be ranked. Serum proteomic pattern 
analysis can be used for the generation of multiple, highly accurate models using a hybrid 
quadrupole time-of-flight (Qq-TOF) MS for an improved early diagnosis of ovarian 
cancer. 

Efi^f Description of ^Figures 

[1009] FIGS. 1A and IB compare the mass spectra from control serum prepared on a 
WCX2 ProteinChip array and analyzed with a PBS-II TOF (panel A) or a Qq-TOF 
(panel B) mass spectrometer. 

[1010] FIGS. 2A and 2B show histograms representing the testing results of 
sensitivity (2 A) and specificity (2B) of 108 models for MS data acquired on either a Qq- 
TOF or a PBS-II TOF mass spectrometer. 
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[1011] FIGS. 3A and 3B show histograms representing the testing and blinded 
validation results of sensitivity (3 A) and specificity (3B) of 108 models for MS data 
acquired on either a Qq-TOF or a PBS-II TOF mass spectrometer. 

[1012] FIGS. 4A and 4B compare SELDI Qq-TOF mass spectra of serum from an 
unaffected individual (4A) and an ovarian cancer patient (4B). 

Detailed Description 

Analysis of Serum Samples 

[1013] A total of 248 serum samples were provided from the National Ovarian 
Cancer Early Detection Program (NOCEDP) clinic at Northwestern University Hospital 
(Chicago, Illinios). The samples were processed and their proteomic patterns acquired by 
MS as described below in the description of the methods used. The serum samples in the 
present study were analyzed on the same protein chip arrays by both a PBS-II and a Qq- 
TOF MS fitted with a SELDI ProteinChip array interface. While the spectra acquired 
from both instruments are qualitatively similar, the higher resolution afforded by the Qq- 
TOF MS is apparent from FIG. 1. This increased resolution allows species close in m/z 
unresolved by the PBS-II TOF MS to be distinctly observed in the Qq-TOF mass 
spectrum. Indeed, simulations demonstrate the ability of the Qq-TOF MS (routine 
resolution- S420Q) to completely revive species differing in m/z of only 0,375 (e^ at 
m/z moy whereas complete resolution of spebies ^fli &e PHOT tOF.HS7(routme 
resolution - 150) is only possible for species that differ by m/z of 20 (simulation not 
shown). 

[1014] The mass spectra were analyzed using the ProteomeQuest™ bioinformatics 
tool employing ASCII files consisting of m/z and intensity values of either the PBS-II 
TOF or the Qq-TOF mass spectra as the input The mass spectral data acquired using the 
Qq-TOF MS were binned to precisely define the number of features in each spectrum 
to 7,084 with each feature being comprised of a binned m/z and amplitude value. The 
algorithm examines the data to find a set of features at precise binned m/z values whose 
combined, nonnalized relative intensity values in n-space best segregate the data derived 



4 



WO 2005/011474 



PCT7US2004/024413 



from the training set. Mass spectra acquired on the Qq-TOF and the PBS-II TOF 
instruments from the same sample sets, were restricted to the m/z range from 700 to 
11,893 for direct comparison between the two platforms. The entire set of spectra 
acquired from the serum samples was divided into three data sets: a) a training set that is 
used to discover the hidden diagnostics patterns, b) a testing set, and c) a validation set. 
With this approach only the normalized intensities of the key subset of m/z values 
identified using the tr ainin g set were used to classify the testing and validation sets, and 
the algorithm had not previously "seen" the spectra in the testing and validation sets. 

[1015] The training set was comprised of serum from 28 unaffected women and 56 
women with ovarian cancer. The training and testing set mass spectra were analyzed by 
the bioinformatic algorithm to generate a series of models under the following set 
modeling parameters: a) a similarity space of 85%, 90%, or 95% likeness for cluster 
classification; b) a feature set size of 5, 10, or 15 random m/z values whose combined 
intensities comprise each pattern; and c) a learning rate of 0.1%, 0.2%, or 0.3% for 
pattern generation by the genetic algorithm. Four sets of randomly generated models for 
each of the 27 permutations were derived and queried with the same test set. Sensitivity 
and specificity testing results for each of the 108 models (four rounds of training for each 
of the 27 permutations) were generated, as shown in FIGS. 2A and 2B. These results 
demonstrate that the Qq-TOF MS data produced better results than the lower resolution 
me&mtP-< 0:QQ00t;i3smg^be exact eochran-Armitage test-feee-Agresti-A: -Categorical- 
Data Analysis N&w Ymk: fehn Wiley and Sons (t990))fbr trend) tbjxrogjiout a range of 
modeling conditions. 

[1016] The ability to generate the best performing models for testing and validation 
was statistically evaluated as multiple models were generated and ranked using the entire 
range of the modeling parameters above. Models from the training set were validated 
using a testing set consisting of 31 unaffected and 63 ovarian cancer serum samples. To 
further validate the ability to diagnose ovarian cancer, a set of blinded sample mass 
spectra consisting of an additional 37 normal and 40 ovarian cancer serum mass spectra 
were tested against the model found in training previously discussed. As shown in 
FIGS. 3A and 3B, the results show the ability of the mass spectra from the higher 
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resolution Qq-TOF MS to generate statistically significant (P < 0.00001) superior models 
over the lower resolution PBS-II mass spectra. 

[1017] Fifteen models were found that were 100% sensitive in their ability to 
correctly discriminate unaffected women from those suffering from ovarian cancer, that 
were 100% specific in discriminating women in the test set, and at least 97% specific in 
the validation set. These models are shown in Appendix A, and identified as Model 1 
through Model 15. Of these models, four were found that were both 100% sensitive and 
specific for both sets (Models 4, 9, 10, and 15). 

[1018] Appendix A identifies for each model the following information. First the 
specificity and sensitivity for each model is shown for the Test set and for the Validity 
set. The number of samples for which the model correctly grouped women with a 
"Normal State" (i.e. not having ovarian cancer) and with an "Ovarian Cancer State" is 
then shown for each of the test and validity tests, compared to the total number of 
samples in the corresponding sets. For example, in Model 1, the model correctly 
identified 36 of the 37 women as having a normal state in the Validity set. 

[1019] Finally, for each model a table is set forth showing the constituent "patterns" 
comprising the model. Each pattern corresponds to a point, or node, in the N- 
dimensional space defined by the N m/z values (or "features") included in the model. 
J3ius!^cl^pi^m py^gfm Appendix A m 

therefore shows for each model a table containing the eonstHHigQt jaaKeihs, e&fch pattern 
being in a row identified by a cc Node" number. The table also includes columns for the 
constituent features of the patterns, with the m/z value for each pattern identified at the 
top of the column. The amplitudes are shown for each feature, for each pattern, and are 
normalized to 1.0. The remaining four columns in each table are labeled "Count," 
"State," "StateSum," and "Error." "Count" is the number of samples in the Training set 
that correspond to the identified node. "State" indicates the state of the node, where 1 
indicates diseased (in this case, having ovarian cancer) and 0 indicates normal (not 
having the disease). "StateSum" is the sum of the state values for all of the correctly 
classified members of the indicated node, while "Error 55 is the number of incorrectly 
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classified members of the indicated node. Thus, for node 5 in Model 1,13 samples were 
assigned to the node, whereas 1 1 samples were actually diseased. StateSum is thus 1 1 
(rather than 13) and Error is 2. 

[1020] Examination of the key m/z features that comprise the four best performing 
models (Models 4, 9, 10, and 15) reveals certain features (i.e., contained within m/z 
bins 7060.121, 8605.678 and 8706.065) that are consistently present as classifiers in 
those models. 

[1021] Although the proteomic patterns generated from both healthy and cancer 
patients using the Qq-TOF MS are quite similar (as seen by comparing FIGS. 4A to 4B), 
careful inspection of the raw mass spectra reveals that peaks within the binned m/z values 
7060.121 and 8605.678 are differentially abundant in a selection of the serum samples 
obtained from ovarian cancer patients as compared to unaffected individuals and that the 
features that the ProteomeQuest™ software selected are "real" features and not noise. 
The insets in FIGS. 4A and 4B show expanded m/z regions highlighting significant 
intensity differences of the peaks in the m/z bins 7060.121 and 8605.678 (indicated by 
brackets) identified by the algorithm as belonging to the optimum discriminatory pattern. 
These results indicate these MS peaks originate from species that may be consistent 
indicators of the presence of ovarian cancer. The ability to distinguish sera from an 
unaffected individual or an individual with ovarian cancer based on a single serum 
*^ 
While a single key m/z species is insufficient to globally distinguish all of the unaffected 
and ovarian cancer patients, taken together the combined peak intensities of key ions does 
allow the two data sets to be completely distinguished. 

[1022] The four best performing models that are 100% sensitive and specific for the 
blinded testing and validation tests were chosen for further analysis. Table 1 shows 
bioinformatic classification results of serum samples from masked testing and validation 
sets by proteomic pattern classification using the best performing models. 





Actual 


Predicted (%) 


Benign / Unaffected 


68 


68 (100) 


Ovarian Cancer Stage 1 


22 


22(100) 


Ovarian Cancer Stage II, III, IV 


81 


81 (100) 
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Table 1 

Each of these models was able to successfully diagnose the presence of ovarian cancer in 
all of the serum samples from affected women. Further, no false positive or false 
negative classifications occurred with these best performing models. 

Discussion 

[1023] A limitation of individual cancer biomarkers is the lack of sensitivity and 
specificity when applied to large heterogeneous populations. Biomarker pattern analysis 
seeks to overcome the limitation of individual biomarkers. Serum proteomic pattern 
analysis can provide new tools for early diagnosis, therapeutic monitoring and outcome 
analysis. Its usefulness is enhanced by the ability of a selected set of features to 
transcend the biologic heterogeneity and methodological background "noise." This 
diagnostic goal is aided by employing a genetic algorithm coupled with a self-organizing • 
cluster analysis to discover diagnostic subsets of m/z features and their relative intensities 
contained within high-resolution Qq-TOF mass spectral data. 

[1024] It is believed that diagnostic serum proteomic feature sets exist within 
constellations of small proteins and peptides. A given signature pattern reflects changes 
in the physiologic or pathologic state of a target tissue. With regard to cancer markers, it 
is believed that serum diagnostic patterns are a product of the complex tumor-host 

derived from multiple modified host ptoteiiis rather than emanating exclusively from the 
cancer cells. The biomarker profile may be amplified by tumor-host interactions. This 
amplification includes, for example, the generation of peptide cleavage products by 
tumor or host proteases. There may exist multiple dependent, or independent, sets of 
proteins/peptides that reflect the underlying tissue pathology. Hence, the disease related 
proteomic pattern information content in blood might be richer than previously 
anticipated. Rather than a single "best" feature set, multiple proteomic feature sets may 
exist that achieve highly accurate discrimination and hence diagnostic power. This 
possibility is supported by the data described above. 
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[1025] The low molecular weight senim proteome is an unexplored archive, even 
though this is the mass region where MS is best suited for analysis. It is thought likely 
that disease-associated species are comprised of low molecular weight peptide/protein 
species that vary in mass by as little as a few Daltons. Thus a higher resolution mass 
spectrometer would be expected to discriminate and discover patterns not resolvable by a 
lower resolution instrument. The spectra produced by a Qq-TOF MS were compared to 
that of the Ciphergen PBS-II TOF MS. The routine resolution obtained is in excess of 
8000 (at m/z = 1500) for the Qq-TOF MS and 150 (at m/z = 1500) for the PBS-II TOF 
mass spectrometer. A SELDI source was used so that both instruments analyzed the 
same sample on distinct regions of the protein chip array bait surface. While the overall 
spectral profile is similar, a single peak on the PBS-II TOF MS is resolved into a 
multitude of peaks on the Qq-TOF MS (seen by comparing FIGS 1A and IB to FIGS. 4A 
and 4B). Moreover, the inherent increase in mass accuracy by higher resolution 
instrumentation that has uncoupled the mass analyzer from the source will provide for 
cleaner spectra as this will suppress confounding metastable ions, generate spectra with 
lower mass drift over time and instruments at the same time as generating more complex, 
highly resolved data. 

[1026] In the first phase of comparison, proteomic patterns from mass spectra derived 
from the same training sets and generated on the high and low-resolution mass 
■spegtrpmete 

mp^^g cpjistraiiiJs in* which patterns- were generated- usiag" three different degrees of 
similarity space for the self-organizing clusters to form, three different sets of feature 
sizes chosen, and three different mutation rates for a total of 27 modeling permutations. 
Sensitivity and specificity testing results for each of the 108 models (shown in FIGS. 2A 
and 2B), produced from four rounds of training for each of the 27 permutations, 
demonstrate that the Qq-TOF MS generated spectra consistently outperformed the lower 
resolution TOF-MS spectra (P < 0.00001} independent of the modeling criteria used. 

[1027] Since the spectra from the higher resolution platform generate patterns with a 
higher level of sensitivity and specificity, those spectra could generate more accurate 
models with a higher degree of sensitivity and specificity - that is, generate the best 
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diagnostic models. These results were generated using even more stringent criteria, in 
that an additional masked validation set was employed after testing to determine overall 
accuracy. The higher resolution spectra consistently produced significantly more 
accurate models as seen in both the testing and validation studies (as shown in FIGS. 3A 
and 3B). The models derived from the Qq-TOF MS were consistently more sensitive and 
specific (P < 0.00001) than those from the PBS-II TOF MS. Four models were generated 
that attained 100% sensitivity and specificity in both testing and validation. The number 
of key m/z values used as classifiers in the four best diagnostic models ranged from 5 
to 9. Three m/z bin values were found in two of these four models and two m/z bins were 
found in three of the four best models. The distinct peaks present in the recurring m/z 
bins 7060.121, 8605.678 and 8706.065 maybe good candidates for low molecular weight 
components in serum that may be key disease progression indicators. 

[1028] These data support the existence of multiple highly accurate and distinct 
proteomic feature sets that can accurately distinguish ovarian cancer. To screen for 
diseases of relatively low prevalence, such as ovarian cancer, a diagnostic test preferably 
exceeds 99% sensitivity and specificity to minimize false positives, while correctly 
detecting early stage disease when it is present. As discussed above, four models 
generated using high-resolution Qq-TOF MS data achieved 100% sensitivity and 
specificity. In blinded testing and validation studies any one of these models were used 
ta coreectly cfas^ 8-1/&1 ovarkn eaneer- stage- Hy HI an d 

IV and 158/68 benign disea^contrak. 

[1029] Thus, a clinical test could simultaneously employ several combinations of 
highly accurate diagnostic proteomic patterns arising concomitantly from the same data 
streams, which, taken together, could achieve an even higher degree of accuracy in a 
screening setting where a diagnostic test will face large population heterogeneity and 
potential variability in sample quality and handling. Hence, a high-resolution system, 
such as the Qq-TOF MS employed in this study, is preferred based on the present results. 

Methods 
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[1030] Serum Samples: Serum samples were obtained from the National Ovarian 
Cancer Early Detection Program (NOCEDP) clinic at Northwestern University Hospital 
(Chicago, Illinois). Two hundred and forty eight samples were prepared using a Biomek 
2000 robotic liquid handler (Beckman Coulter, Inc., Palo Alto, California). All analyses 
were performed using ProteinChip weak cation exchange interaction chips (WCX2, 
Ciphergen Biosystems Inc., Fremont, California). A control sample was randomly 
applied to one spot on each protein array as a quality control for sample preparation and 
mass spectrometer function. The control sample, SRM 1951A, which is comprised of 
pooled human sera, was provided by the National Institute of Standards and Technology 
(NIST). 

[1031] Sample Preparation: WCX2 ProteinChip arrays were processed in parallel 
using a Biomek Laboratory workstation (Beckman-Coulter) modified to make use of a 
ProteinChip array bioprocessor (Ciphergen Biosystems Inc.). The bioprocessor holds 12 
ProteinChips, each having 8 chromatographic "spots'*, allowing 96 samples to be 
processed in parallel. One hundred \x\ of 10 mM HCL was applied to the WCX2 protein 
arrays and allowed to incubate for 5 minutes. The HC1 was aspirated, discarded and 100 
\x\ of distilled, deionized water (ddH20) was applied and allowed to incubate for 1 
minute. The ddH^O was aspirated, discarded, and reapplied for another minute. One 
hundred \il of 10 mM NH4HCO3 with 0.1% Triton X-100 was applied to the surface and 
altowed trr incubate* for *5 tninutes «fter whiehr^esplution- was-aspirated- and-dis6ar4edr 
A 'Second applifclti&ii 6f m0 f£L Of 10 mM ^ NH^HCOs with 0:1% Triton-.X-l-QQ was 
applied and allowed to incubate for 5 minutes after which the ProteinChip array bait 
surfaces were aspirated. Five \i\ of raw, undiluted serum was applied to each ProteinChip 
WCX2 bait surface and allowed to incubate for 55 minutes. Each ProteinChip array was 
washed 3 times with Dulbecco's phosphate buffered saline (PBS) and ddH 2 0. For each 
wash, 150 \xl of either PBS or ddH20 was sequentially dispensed, mixed by aspirating, 
and dispensed for a total of 10 times in the bioprocessor after which the solution was 
aspirated to waste. This wash process was repeated for a total of 6 washes per 
ProteinChip array bait surface. The ProteinChip array bait surfaces were vacuum dried to 
prevent cross contamination when the bioprocessor gasket was removed. After removing 
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the bioprocessor gasket, 1.0 fil of a saturated solution of a-cyano-5-hydroxycinnamic 
acid in 50% (v/v) acetonitrile, 0.5% (v/v) trifluoroacetic acid was applied to each spot on 
the ProteinChip array twice, allowing the solution to dry between applications. 

[1032] PBS-II Analysis: ProteinChip arrays were placed in the Protein Biological 
System II time-of-flight mass spectrometer (PBS-II, Ciphergen Biosystems Inc.) and 
mass spectra were recorded using the following settings: 195 laser shots/spectrum 
collected in positive mode, laser intensity 220, detector sensitivity 5, detector voltage 
1850, and a mass focus of 6,000 Da. The PBS-II was externally calibrated using the 
"AU-In-One" peptide mass standard (Ciphergen Biosystems, Inc.). 

[1033] Qq-TOF MS Analysis: ProteinChip arrays were analyzed using a hybrid 
quadrapole time-of-flight mass spectrometer (QSTAR pulsar i, Applied Biosystems Inc., 
Framingham, Massachusetts) fitted with a ProteinChip array interface (Ciphergen 
Biosystems Inc., Fremont, California), Samples were ionized with a 337 tim pulsed 
nitrogen laser (ThermoLaser Sciences model VSL-337-ND-S, Waltham, Massachusetts) 
operating at 30 Hz. Approximately 20 mTorr of nitrogen gas was used for collisional ion 
cooling. Each spectrum represents 100 multi-channel averaged scans (1.667 min 
acquisition/spectrum). The mass spectrometer was externally calibrated using a mixture 
of known peptides. 

exporting lite raw data file generated from the Qq-TOF mass spechffii Mfd a fab- 
delimited format that generated approximately 350,000 data points per spectrum. The 
data files were binned using a function of 400 parts per million (ppm) such that all data 
files possess identical m/z values (e.g., the m/z bin sizes linearly increased from 0.28 at 
m/z 700 to 4.75 at m/z 12,000). The intensities in each 400 ppm bin were summed. This 
binning process condenses the number of data points to exactly 7,084 points per sample. 
The binned spectral data were separated into approximately three equal groups for 
training, testing and blind validation. The training set consisted of 28 normal and 56 
ovarian cancer samples. The models were built on the training set using 
ProteomeQuest™ (Correlogic Systems Inc., Bethesda, Maryland) and validated using the 
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testing samples, which consisted of 30 normal and 57 ovarian cancer samples. The 
model was validated using blinded samples, which consisted of 37 normal and 40 ovarian 
cancer samples. These ni/z values that were found to be classifiers used to distinguish 
serum from a patient with ovarian cancer from that of an unaffected individual are based 
on the binned data and not the actual m/z values from the raw mass spectra. 

[1035] Statistical significance of the results generated using the Qq-TOF and PBS-II 
: MS was performed using the exact Cochran-Armitage test for trend to compare the 
distributions of these specificity and sensitivity values between the two instrumental 
platforms evaluated since the models are constructed independently from each other. 
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What is claimed is: 

1 . A model usable in determining whether a biological sample taken from a subject 
indicates that the subject has ovarian cancer, comprising: 

a vector space having at least three dimensions; and 

at least one diagnostic cluster defined in said vector space, said diagnostic cluster 
corresponding to one of a diseased cluster and a healthy cluster, 

said vector space having a first dimension that corresponds to a first mass to 
charge ratio value from a mass spectrum, said first mass to charge ratio being about 7060, 
said vector space having a second dimension that corresponds to a second mass to charge 
ratio value from a mass spectrum, said second mass to charge ratio being about 8605, and 
said vector space haying a third dimension that corresponds to a third mass to charge 
ratio value from a mass spectrum, said third mass to charge ratio being about 8706. 

2. The model of claim 1, wherein the vector space has at least four dimensions, said 
vector space having a fourth dimension that corresponds to a fourth mass to charge ratio 
value from a mass spectrum, said fourth mass to charge ratio being about 6548. 

3. A model usable in determining whether a biological sample taken from a subject 
indicates that the subject has ovarian cancer, comprising: 

^eeto^sgae^avi^^ a ^ — 

at least one diagnostic cluster defined in said vector space, said diagnostic cluster 
corresponding to one of a diseased cluster and a healthy cluster, 

said vector space having a first dimension ,that corresponds to a first mass to 
charge ratio value from a mass spectrum, said first mass to charge ratio being about 9807, 
said vector space having a second dimension that corresponds to a second mass to charge 
ratio value from a mass spectrum, said second mass to charge ratio being about 2374, and 
said vector space having a third dimension that corresponds to a third mass to charge 
ratio value from a mass spectrum, said third mass to charge ratio being about 1276. 
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4. The model of claim 3, wherein the vector space has at least four dimensions, said 
vector space having a fourth dimension that corresponds to a fourth mass to charge ratio 
value from a mass spectrum, said fourth mass to charge ratio being about 4292. 

5. A method of determining whether a biological sample taken from a subject 
indicates that the subject has ovarian cancer by analyzing the biological sample to obtain 
a data stream that describes the biological sample, comprising: 

a. abstracting the data stream to produce a sample vector that 
characterizes the data stream in a predetermined vector space containing a 
diagnostic cluster, the diagnostic cluster being, an ovarian cancer cluster, the 
ovarian cancer cluster corresponding to the presence of ovarian cancer; 

b. determining whether the sample vector rests within the ovarian 
cancer cluster; and 

c if the sample vector rests within the ovarian cancer cluster, 
identifying the biological sample as being taken from a subject that has ovarian 
cancer. 
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